Relational Databases
Database Management System
Databases in R
RMySQL
RPostgresSQL
ROracle
DBI
Establish a connection
The first step to import data from a SQL database is creating a connection to it. As Filip explained, you need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI
package.
dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver
object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().
If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname
, host
, port
, user
and password
. Most of these details have already been provided.
# Load the DBI package
library(DBI)
# Edit dbConnect() call
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
con
## <MySQLConnection:0,0>
Awesome! You are now connected to the MySQL database tweater
.
List the database tables
After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function. As you might remember from the video, this function requires the connection object as an input, and outputs a character vector with the table names.
# Load the RMySQL package
library(DBI)
# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Build a vector of table names: tables
tables <- dbListTables(con)
# Display structure of tables
str(tables)
## chr [1:3] "comments" "tweats" "users"
Good! dbListTables() can be very useful to get a first idea about the contents of your database. Can you guess what kind of information this database contains?
Import users
As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users
, tweats
, and comments
that have relations among them. Which ones, you ask? You’ll discover in a moment!
Let’s start by importing the data on the users into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con
), followed by the name of the table you want to import. The resulting object is a standard R data frame.
# Load the DBI package
library(DBI)
# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Import the users table from tweater: users
users <- dbReadTable(con, "users")
# Print users
users
## id name login
## 1 1 elisabeth elismith
## 2 2 mike mikey
## 3 3 thea teatime
## 4 4 thomas tomatotom
## 5 5 oliver olivander
## 6 6 kate katebenn
## 7 7 anjali lianja
Import all tables
Next to the users
, we’re also interested in the tweats
and comments
tables. However, separate dbReadTable()
calls for each and every one of the tables in your database would mean a lot of code duplication. Remember about the lapply()
function? You can use it again here! A connection is already coded for you, as well as a vector table_names
, containing the names of all the tables in the database.
# Load the DBI package
library(DBI)
# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Get table names
table_names <- dbListTables(con)
# Import all tables
tables <- lapply(table_names, dbReadTable, conn = con)
# Print out tables
tables
## [[1]]
## id tweat_id user_id message
## 1 1022 87 7 nice!
## 2 1000 77 7 great!
## 3 1011 49 5 love it
## 4 1012 87 1 awesome! thanks!
## 5 1010 88 6 yuck!
## 6 1026 77 4 not my thing!
## 7 1004 49 1 this is fabulous!
## 8 1030 75 6 so easy!
## 9 1025 88 2 oh yes
## 10 1007 49 3 serious?
## 11 1020 77 1 couldn't be better
## 12 1014 77 1 saved my day
##
## [[2]]
## id user_id
## 1 75 3
## 2 88 4
## 3 77 6
## 4 87 5
## 5 49 1
## 6 24 7
## post
## 1 break egg. bake egg. eat egg.
## 2 wash strawberries. add ice. blend. enjoy.
## 3 2 slices of bread. add cheese. grill. heaven.
## 4 open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6 just eat an apple. simply and healthy.
## date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
##
## [[3]]
## id name login
## 1 1 elisabeth elismith
## 2 2 mike mikey
## 3 3 thea teatime
## 4 4 thomas tomatotom
## 5 5 oliver olivander
## 6 6 kate katebenn
## 7 7 anjali lianja
Wonderful! Now that you have an R version of all data that is contained in the database, you can dive a little deeper into the relations between the different data frames.
Selective importing
Query tweater (1)
In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.
dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects the age
variable from the people
dataset where gender equals "male"
:
dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")
A connection to the tweater
database has already been coded for you.
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Import tweat_id column of comments where user_id is 1: elisabeth
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
# Print elisabeth
elisabeth
## tweat_id
## 1 87
## 2 49
## 3 77
## 4 77
Nice! To make sure you understood SQL’s SELECT
- FROM
- WHERE
syntax, let’s practice some more.
Query tweater (2)
Apart from checking equality, you can also check for less than and greater than relationships, with <
and >
, just like in R.
con
, a connection to the tweater
database, is again available.
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Import post column of tweats where date is higher than '2015-09-21': latest
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > '2015-09-21'")
# Print latest
latest
## post
## 1 open and crush avocado. add shrimps. perfect starter.
## 2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 3 just eat an apple. simply and healthy.
Nice! To make sure you understood SQL’s SELECT
- FROM
- WHERE
syntax, let’s practice some more.
Query tweater (3)
Suppose that you have a people
table, with a bunch of information. This time, you want to find out the age
and country
of married males. Provided that there is a married
column that’s 1 when the person in question is married, the following query would work.
SELECT age, country
FROM people
WHERE gender = "male" AND married = 1
Can you use a similar approach for a more specialized query on the tweater
database?
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4")
# Print specific and short
specific
## message
## 1 great!
Query tweater (4)
There are also dedicated SQL functions that you can use in the WHERE
clause of an SQL query. For example, CHAR_LENGTH()
returns the number of characters in a string.
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Create data frame short
short <- dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) < 5")
# Print short
short
## id name
## 1 2 mike
## 2 3 thea
## 3 6 kate
Join the query madness!
Of course, SQL does not stop with the the three keywords SELECT
, FROM
and WHERE
. Another very often used keyword is JOIN
, and more specifically INNER JOIN
. Take this call for example:
SELECT name, post
FROM users INNER JOIN tweats on users.id = user_id
WHERE date > "2015-09-19"
Here, the users
table is joined with the tweats
table. This is possible because the id
column in the users
table corresponds to the user_id
column in the tweats
table. Also notice how name
, from the users
table, and post
and date
, from the tweats
table, can be referenced to without problems.
Can you predict the outcome of the following query?
SELECT post, message
FROM tweats INNER JOIN comments on tweats.id = tweat_id
WHERE tweat_id = 77
A connection to the tweater
database is already available as con
; feel free to experiment!
dbGetQuery(con, "SELECT post, message
FROM tweats INNER JOIN comments on tweats.id = tweat_id
WHERE tweat_id = 77")
## post message
## 1 2 slices of bread. add cheese. grill. heaven. great!
## 2 2 slices of bread. add cheese. grill. heaven. not my thing!
## 3 2 slices of bread. add cheese. grill. heaven. couldn't be better
## 4 2 slices of bread. add cheese. grill. heaven. saved my day
Bellissimo! Have you tried running this query with dbGetQuery()? The result contains data both from the tweats
and the comments
column.
Send - Fetch - Clear
You’ve used dbGetQuery() multiple times now. This is a virtual function from the DBI
package, but is actually implemented by the RMySQL
package. Behind the scenes, the following steps are performed:
Let’s not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n
argument inside dbFetch().
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
# Use dbFetch() twice
dbFetch(res, n = 2)
## id tweat_id user_id message
## 1 1022 87 7 nice!
## 2 1000 77 7 great!
dbFetch(res)
## id tweat_id user_id message
## 1 1011 49 5 love it
## 2 1010 88 6 yuck!
## 3 1030 75 6 so easy!
# Clear res
dbClearResult(res)
## [1] TRUE
Perfect! In our toy example, chopping up the fetches doesn’t make a lot of sense, but make sure to remember this technique when you’re struggling with huge databases!
Be polite and …
Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL
automatically specifies a maximum of open connections and closes some of the connections for you, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.
The code that connects you to the database is already available, can you finish the script?
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Create the data frame long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) > 40")
# Print long_tweats
long_tweats
## post
## 1 wash strawberries. add ice. blend. enjoy.
## 2 2 slices of bread. add cheese. grill. heaven.
## 3 open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE
Wonderful! This concludes the chapter on databases. Of course, there is tons more to learn about interfacing to databases and working with them as efficiently as possible, but that’s something for more advanced courses.
Data on the web
Import flat files from the web
In the video, you saw that the utils
functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.
You must be wondering whether Hadley Wickham’s alternative package, readr
, is equally potent. Well, figure it out in this exercise! The URLs for both a .csv
file as well as a .delim
file are already coded for you. It’s up to you to actually import the data. If it works, that is…
# Load the readr package
library(readr)
# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)
##
## -- Column specification --------------------------------------------------------
## cols(
## Name = col_character(),
## Address = col_character(),
## Latitude = col_double(),
## Longitude = col_double()
## )
# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)
##
## -- Column specification --------------------------------------------------------
## cols(
## area = col_double(),
## temp = col_double(),
## size = col_double(),
## storage = col_double(),
## method = col_double(),
## texture = col_double(),
## flavor = col_double(),
## moistness = col_double()
## )
# Print pools and potatoes
pools
## # A tibble: 20 x 4
## Name Address Latitude Longitude
## <chr> <chr> <dbl> <dbl>
## 1 Acacia Ridge Leisure Centre 1391 Beaudesert Road, Acacia~ -27.6 153.
## 2 Bellbowrie Pool Sugarwood Street, Bellbowrie -27.6 153.
## 3 Carole Park Cnr Boundary Road and Waterf~ -27.6 153.
## 4 Centenary Pool (inner City) 400 Gregory Terrace, Spring ~ -27.5 153.
## 5 Chermside Pool 375 Hamilton Road, Chermside -27.4 153.
## 6 Colmslie Pool (Morningside) 400 Lytton Road, Morningside -27.5 153.
## 7 Spring Hill Baths (inner Ci~ 14 Torrington Street, Spring~ -27.5 153.
## 8 Dunlop Park Pool (Corinda) 794 Oxley Road, Corinda -27.5 153.
## 9 Fortitude Valley Pool 432 Wickham Street, Fortitud~ -27.5 153.
## 10 Hibiscus Sports Complex (up~ 90 Klumpp Road, Upper Mount ~ -27.6 153.
## 11 Ithaca Pool ( Paddington) 131 Caxton Street, Paddington -27.5 153.
## 12 Jindalee Pool 11 Yallambee Road, Jindalee -27.5 153.
## 13 Manly Pool 1 Fairlead Crescent, Manly -27.5 153.
## 14 Mt Gravatt East Aquatic Cen~ Cnr wecker Road and Newnham ~ -27.5 153.
## 15 Musgrave Park Pool (South B~ 100 Edmonstone Street, South~ -27.5 153.
## 16 Newmarket Pool 71 Alderson Stret, Newmarket -27.4 153.
## 17 Runcorn Pool 37 Bonemill Road, Runcorn -27.6 153.
## 18 Sandgate Pool 231 Flinders Parade, Sandgate -27.3 153.
## 19 Langlands Parks Pool (Stone~ 5 Panitya Street, Stones Cor~ -27.5 153.
## 20 Yeronga Park Pool 81 School Road, Yeronga -27.5 153.
potatoes
## # A tibble: 160 x 8
## area temp size storage method texture flavor moistness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 2.9 3.2 3
## 2 1 1 1 1 2 2.3 2.5 2.6
## 3 1 1 1 1 3 2.5 2.8 2.8
## 4 1 1 1 1 4 2.1 2.9 2.4
## 5 1 1 1 1 5 1.9 2.8 2.2
## 6 1 1 1 2 1 1.8 3 1.7
## 7 1 1 1 2 2 2.6 3.1 2.4
## 8 1 1 1 2 3 3 3 2.9
## 9 1 1 1 2 4 2.2 3.2 2.5
## 10 1 1 1 2 5 2 2.8 1.9
## # ... with 150 more rows
Great! It seems to work without any additional problems!
Secure importing
In the previous exercises, you have been working with URLs that all start with http://
. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HypterText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.
Luckily for us, you can use the standard importing functions with https://
connections since R version 3.2.2.
# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
# Import the file using read.csv(): pools1
pools1 <- read.csv(url_csv)
# Load the readr package
library(readr)
# Import the file using read_csv(): pools2
pools2 <- read_csv(url_csv)
##
## -- Column specification --------------------------------------------------------
## cols(
## Name = col_character(),
## Address = col_character(),
## Latitude = col_double(),
## Longitude = col_double()
## )
# Print the structure of pools1 and pools2
str(pools1)
## 'data.frame': 20 obs. of 4 variables:
## $ Name : chr "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
## $ Address : chr "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
## $ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num 153 153 153 153 153 ...
str(pools2)
## tibble [20 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Name : chr [1:20] "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
## $ Address : chr [1:20] "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
## $ Latitude : num [1:20] -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num [1:20] 153 153 153 153 153 ...
## - attr(*, "spec")=
## .. cols(
## .. Name = col_character(),
## .. Address = col_character(),
## .. Latitude = col_double(),
## .. Longitude = col_double()
## .. )
Great! It seems to work without any additional problems!
Why download.file()?
Reproducibility
HTTP from inside R
httr
- Hadley WickhamImport Excel files from the web
When you learned about gdata
, it was already mentioned that gdata
can handle .xls
files that are on the internet. readxl
can’t, at least not yet. The URL with which you’ll be working is already available in the sample code. You will import it once using gdata
and once with the readxl
package via a workaround.
# Load the readxl and gdata package
library(readxl)
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
# Specification of url: url_xls
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
# Import the .xls file with gdata: excel_gdata
excel_gdata <- read.xls(url_xls)
# Download file behind URL, name it local_latitude.xls
download.file(url_xls, destfile = "_data/local_latitude.xls", quiet = TRUE, mode = "wb")
# Import the local .xls file with readxl: excel_readxl
excel_readxl <- read_excel("_data/local_latitude.xls")
Nice! It appears that readxl
is not (yet?) able to deal with Excel files that are on the web. However, a simply workaround with download.file() fixes this.
Downloading any file, secure or not
In the previous exercise you’ve seen how you can read excel files on the web using the read_excel
package by first downloading the file with the download.file() function.
There’s more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData
files. An RData
file is very efficient format to store R data.
You can load data from an RData
file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you’ll first download the RData file securely, and then import the local data file.
# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
# Download the wine file to your working directory
download.file(url_rdata, destfile = "_data/wine_local.RData")
# Load the wine data into your workspace using load()
load("_data/wine_local.RData")
# Print out the summary of the wine data
summary(wine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700
## Median : 98.00 Median :2.350 Median :2.130 Median :0.3400
## Mean : 99.59 Mean :2.292 Mean :2.023 Mean :0.3623
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue Proline
## Min. :0.410 Min. : 1.280 Min. :1.270 Min. : 278.0
## 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
## Median :1.550 Median : 4.680 Median :2.780 Median : 672.0
## Mean :1.587 Mean : 5.055 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :3.580 Max. :13.000 Max. :4.000 Max. :1680.0
Great! Another way to load remote RData
files is to use the url() function inside load(). However, this will not save the RData
file to a local file.
HTTP? httr! (1)
Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.
httr
provides a convenient function, GET() to execute this GET request. The result is a response
object, that provides easy access to the status code, content-type and, of course, the actual content.
You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as
argument, it’ll try its best to figure out which type is most appropriate based on the content-type.
# Load the httr package
library(httr)
# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)
# Print resp
resp
## Response [http://www.example.com/]
## Date: 2020-12-11 20:52
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 1.26 kB
## <!doctype html>
## <html>
## <head>
## <title>Example Domain</title>
##
## <meta charset="utf-8" />
## <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
## <meta name="viewport" content="width=device-width, initial-scale=1" />
## <style type="text/css">
## body {
## ...
# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")
# Print the head of raw_content
head(raw_content)
## [1] 3c 21 64 6f 63 74
Great! The raw content of the response doesn’t make a lot of sense, does it? Luckily, the content() function by default, if you don’t specify the as
argument, figures out what type of data you’re dealing with and parses it for you.
HTTP? httr! (2)
Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp’s Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.
You’ll learn about Web APIs and JSON in the video and exercises that follow, but some experimentation never hurts, does it?
# httr is already loaded
# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)
# Print resp
resp
## Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
## Date: 2020-12-11 20:52
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 932 B
# Print content of resp as text
content(resp, as = "text")
## [1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"98%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"249,146\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"N/A\",\"BoxOffice\":\"N/A\",\"Production\":\"Rollins-Joffe Productions\",\"Website\":\"N/A\",\"Response\":\"True\"}"
# Print content of resp
content(resp)
## $Title
## [1] "Annie Hall"
##
## $Year
## [1] "1977"
##
## $Rated
## [1] "PG"
##
## $Released
## [1] "20 Apr 1977"
##
## $Runtime
## [1] "93 min"
##
## $Genre
## [1] "Comedy, Romance"
##
## $Director
## [1] "Woody Allen"
##
## $Writer
## [1] "Woody Allen, Marshall Brickman"
##
## $Actors
## [1] "Woody Allen, Diane Keaton, Tony Roberts, Carol Kane"
##
## $Plot
## [1] "Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall."
##
## $Language
## [1] "English, German"
##
## $Country
## [1] "USA"
##
## $Awards
## [1] "Won 4 Oscars. Another 26 wins & 8 nominations."
##
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg"
##
## $Ratings
## $Ratings[[1]]
## $Ratings[[1]]$Source
## [1] "Internet Movie Database"
##
## $Ratings[[1]]$Value
## [1] "8.0/10"
##
##
## $Ratings[[2]]
## $Ratings[[2]]$Source
## [1] "Rotten Tomatoes"
##
## $Ratings[[2]]$Value
## [1] "98%"
##
##
## $Ratings[[3]]
## $Ratings[[3]]$Source
## [1] "Metacritic"
##
## $Ratings[[3]]$Value
## [1] "92/100"
##
##
##
## $Metascore
## [1] "92"
##
## $imdbRating
## [1] "8.0"
##
## $imdbVotes
## [1] "249,146"
##
## $imdbID
## [1] "tt0075686"
##
## $Type
## [1] "movie"
##
## $DVD
## [1] "N/A"
##
## $BoxOffice
## [1] "N/A"
##
## $Production
## [1] "Rollins-Joffe Productions"
##
## $Website
## [1] "N/A"
##
## $Response
## [1] "True"
Great! The fact that httr
converts the JSON response body automatically to an R list is very convenient.
Other data formats
API
jsonlite
From JSON to R
In the simplest setting, fromJSON() can convert character strings that represent JSON data into a nicely structured R list. Give it a try!
# Load the jsonlite package
library(jsonlite)
# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)
# Print structure of wine
str(wine)
## List of 5
## $ name : chr "Chateau Migraine"
## $ year : int 1997
## $ alcohol_pct: num 12.4
## $ color : chr "red"
## $ awarded : logi FALSE
Quandl API
As Filip showed in the video, fromJSON() also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let’s try this out on the Quandl API, where you can fetch all sorts of financial and economical data.
# jsonlite is preloaded
# Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
# Import Quandl data: quandl_data
quandl_data <- fromJSON(quandl_url)
# Print structure of quandl_data
str(quandl_data)
## List of 1
## $ dataset_data:List of 10
## ..$ limit : NULL
## ..$ transform : NULL
## ..$ column_index: NULL
## ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
## ..$ start_date : chr "2012-05-18"
## ..$ end_date : chr "2018-03-27"
## ..$ frequency : chr "daily"
## ..$ data : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
## ..$ collapse : NULL
## ..$ order : NULL
Great! You successfully imported JSON data directly from the web. If you have a close look at the structure of quandl_data
, you’ll see that the data
element is a matrix.
OMDb API
In the video, you saw how easy it is to interact with an API once you know how to formulate requests. You also saw how to fetch all information on Rain Man from OMDb. Simply perform a GET() call, and next ask for the contents with the content() function. This content() function, which is part of the httr
package, uses jsonlite
behind the scenes to import the JSON data into R.
However, by now you also know that jsonlite
can handle URLs itself. Simply passing the request URL to fromJSON() will get your data into R. In this exercise, you will be using this technique to compare the release year of two movies in the Open Movie Database.
# The package jsonlite is already loaded
# Definition of the URLs
url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"
# Import two URLs with fromJSON(): sw4 and sw3
sw4 <- fromJSON(url_sw4)
sw3 <- fromJSON(url_sw3)
# Print out the Title element of both lists
sw4$Title
## [1] "Star Wars: Episode IV - A New Hope"
sw3$Title
## [1] "Star Wars: Episode III - Revenge of the Sith"
# Is the release year of sw4 later than sw3?
sw4$Year > sw3$Year
## [1] FALSE
Well done! The fourth episode of the Star Wars saga was released before the third one! Enough of the fromJSON() now. Let’s try to convert some R data into JSON format now.
Other jsonlite functions
toJSON()
prettify()
minify()
JSON practice (1) JSON is built on two structures: objects and arrays. To help you experiment with these, two JSON strings are included in the sample code. It’s up to you to change them appropriately and then call jsonlite
’s fromJSON() function on them each time.
# jsonlite is already loaded
# Challenge 1
json1 <- '[1, 2, 3, 4, 5, 6]'
fromJSON(json1)
## [1] 1 2 3 4 5 6
# Challenge 2
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)
## $a
## [1] 1 2 3
##
## $b
## [1] 4 5 6
JSON practice (2)
We prepared two more JSON strings in the sample code. Can you change them and call jsonlite
’s fromJSON() function on them, similar to the previous exercise?
# jsonlite is already loaded
# Challenge 1
json1 <- '[[1, 2], [3, 4]]'
fromJSON(json1)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
# Challenge 2
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
fromJSON(json2)
## a b
## 1 1 2
## 2 3 4
## 3 5 6
Great! As you can see different JSON data structures will lead to different data structures in R
.
toJSON()
Apart from converting JSON to R with fromJSON(), you can also use toJSON() to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json
, which is basically a character string representing that JSON.
For this exercise, you will be working with a .csv
file containing information on the amount of desalinated water that is produced around the world. As you’ll see, it contains a lot of missing values. This data can be found on the URL that is specified in the sample code.
# jsonlite is already loaded
# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"
# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)
# Convert the data file according to the requirements
water_json <- toJSON(water)
# Print out water_json
water_json
## [{"water":"Algeria","X1992":0.064,"X2002":0.017},{"water":"American Samoa"},{"water":"Angola","X1992":0.0001,"X2002":0.0001},{"water":"Antigua and Barbuda","X1992":0.0033},{"water":"Argentina","X1992":0.0007,"X1997":0.0007,"X2002":0.0007},{"water":"Australia","X1992":0.0298,"X2002":0.0298},{"water":"Austria","X1992":0.0022,"X2002":0.0022},{"water":"Bahamas","X1992":0.0013,"X2002":0.0074},{"water":"Bahrain","X1992":0.0441,"X2002":0.0441,"X2007":0.1024},{"water":"Barbados","X2007":0.0146},{"water":"British Virgin Islands","X2007":0.0042},{"water":"Canada","X1992":0.0027,"X2002":0.0027},{"water":"Cape Verde","X1992":0.002,"X1997":0.0017},{"water":"Cayman Islands","X1992":0.0033},{"water":"Central African Rep."},{"water":"Chile","X1992":0.0048,"X2002":0.0048},{"water":"Colombia","X1992":0.0027,"X2002":0.0027},{"water":"Cuba","X1992":0.0069,"X1997":0.0069,"X2002":0.0069},{"water":"Cyprus","X1992":0.003,"X1997":0.003,"X2002":0.0335},{"water":"Czech Rep.","X1992":0.0002,"X2002":0.0002},{"water":"Denmark","X1992":0.015,"X2002":0.015},{"water":"Djibouti","X1992":0.0001,"X2002":0.0001},{"water":"Ecuador","X1992":0.0022,"X1997":0.0022,"X2002":0.0022},{"water":"Egypt","X1992":0.025,"X1997":0.025,"X2002":0.1},{"water":"El Salvador","X1992":0.0001,"X2002":0.0001},{"water":"Finland","X1992":0.0001,"X2002":0.0001},{"water":"France","X1992":0.0117,"X2002":0.0117},{"water":"Gibraltar","X1992":0.0077},{"water":"Greece","X1992":0.01,"X2002":0.01},{"water":"Honduras","X1992":0.0002,"X2002":0.0002},{"water":"Hungary","X1992":0.0002,"X2002":0.0002},{"water":"India","X1997":0.0005,"X2002":0.0005},{"water":"Indonesia","X1992":0.0187,"X2002":0.0187},{"water":"Iran","X1992":0.003,"X1997":0.003,"X2002":0.003,"X2007":0.2},{"water":"Iraq","X1997":0.0074,"X2002":0.0074},{"water":"Ireland","X1992":0.0002,"X2002":0.0002},{"water":"Israel","X1992":0.0256,"X2002":0.0256,"X2007":0.14},{"water":"Italy","X1992":0.0973,"X2002":0.0973},{"water":"Jamaica","X1992":0.0005,"X1997":0.0005,"X2002":0.0005},{"water":"Japan","X1997":0.04,"X2002":0.04},{"water":"Jordan","X1997":0.002,"X2007":0.0098},{"water":"Kazakhstan","X1997":1.328,"X2002":1.328},{"water":"Kuwait","X1992":0.507,"X1997":0.231,"X2002":0.4202},{"water":"Lebanon","X2007":0.0473},{"water":"Libya","X2002":0.018},{"water":"Malaysia","X1992":0.0043,"X2002":0.0043},{"water":"Maldives","X1992":0.0004},{"water":"Malta","X1992":0.024,"X1997":0.031,"X2002":0.031},{"water":"Marshall Islands","X1992":0.0007},{"water":"Mauritania","X1992":0.002,"X2002":0.002},{"water":"Mexico","X1992":0.0307,"X2002":0.0307},{"water":"Morocco","X1992":0.0034,"X1997":0.0034,"X2002":0.007},{"water":"Namibia","X1992":0.0003,"X2002":0.0003},{"water":"Netherlands Antilles","X1992":0.063},{"water":"Nicaragua","X1992":0.0002,"X2002":0.0002},{"water":"Nigeria","X1992":0.003,"X2002":0.003},{"water":"Norway","X1992":0.0001,"X2002":0.0001},{"water":"Oman","X1997":0.034,"X2002":0.034,"X2007":0.109},{"water":"Peru","X1992":0.0054,"X2002":0.0054},{"water":"Poland","X1992":0.007,"X2002":0.007},{"water":"Portugal","X1992":0.0016,"X2002":0.0016},{"water":"Qatar","X1992":0.065,"X1997":0.099,"X2002":0.099,"X2007":0.18},{"water":"Saudi Arabia","X1992":0.683,"X1997":0.727,"X2002":0.863,"X2007":1.033},{"water":"Senegal","X1992":0,"X2002":0},{"water":"Somalia","X1992":0.0001,"X2002":0.0001},{"water":"South Africa","X1992":0.018,"X2002":0.018},{"water":"Spain","X1992":0.1002,"X2002":0.1002},{"water":"Sudan","X1992":0.0004,"X1997":0.0004,"X2002":0.0004},{"water":"Sweden","X1992":0.0002,"X2002":0.0002},{"water":"Trinidad and Tobago","X2007":0.036},{"water":"Tunisia","X1992":0.008,"X2002":0.013},{"water":"Turkey","X1992":0.0005,"X2002":0.0005,"X2007":0.0005},{"water":"United Arab Emirates","X1992":0.163,"X1997":0.385,"X2007":0.95},{"water":"United Kingdom","X1992":0.0333,"X2002":0.0333},{"water":"United States","X1992":0.58,"X2002":0.58},{"water":"Venezuela","X1992":0.0052,"X2002":0.0052},{"water":"Yemen, Rep.","X1992":0.01,"X2002":0.01}]
Congratulations! As you can see, the JSON you printed out isn’t easy to read. In the next exercise, you will print out some more JSONs, and explore ways to prettify or minify the output.
Minify and prettify
JSONs can come in different formats. Take these two JSONs, that are in fact exactly the same: the first one is in a minified format, the second one is in a pretty format with indentation, whitespace and new lines:
# Mini
{"a":1,"b":2,"c":{"x":5,"y":6}}
# Pretty
{
"a": 1,
"b": 2,
"c": {
"x": 5,
"y": 6
}
}
Unless you’re a computer, you surely prefer the second version. However, the standard form that toJSON() returns, is the minified version, as it is more concise. You can adapt this behavior by setting the pretty
argument inside toJSON() to TRUE
. If you already have a JSON string, you can use prettify() or minify() to make the JSON pretty or as concise as possible.
# jsonlite is already loaded
# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
# Print pretty_json
#pretty_json
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
# Print mini_json
mini_json
## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]
Great! Hopefully you agree that the pretty format is way easier to read and understand than the minified format! This exercise concludes the course on importing data!
R packages to import data
haven
Import SAS data with haven
haven
is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:
read_sas()
read_dta()
(or read_stata()
, which are identical)read_sav()
or read_por()
, depending on the file type.All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven
will then automatically download the file for you before importing it.
You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat
, which is available in your current working directory. You can also download the data here.
# Load the haven package
library(haven)
# Import sales.sas7bdat: sales
sales <- read_sas("_data/sales.sas7bdat")
# Display the structure of sales
str(sales)
## tibble [431 x 4] (S3: tbl_df/tbl/data.frame)
## $ purchase: num [1:431] 0 0 1 1 0 0 0 0 0 0 ...
## $ age : num [1:431] 41 47 41 39 32 32 33 45 43 40 ...
## $ gender : chr [1:431] "Female" "Female" "Female" "Female" ...
## $ income : chr [1:431] "Low" "Low" "Low" "Low" ...
## - attr(*, "label")= chr "SALES"
Congratulations! As you can see, read_sas() enables you to import SAS data files easily.
Import STATA data with haven
Next up are STATA data files; you can use read_dta() for these.
When inspecting the result of the read_dta()
call, you will notice that one column will be imported as a labelled
vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled
to a factor, you’ll need haven
’s as_factor() function.
In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found here
# haven is already loaded
# Import the data from the URL: sugar
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Structure of sugar
str(sugar)
## tibble [10 x 5] (S3: tbl_df/tbl/data.frame)
## $ Date : dbl+lbl [1:10] 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
## ..@ label : chr "Date"
## ..@ format.stata: chr "%9.0g"
## ..@ labels : Named num [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
## $ Import : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
## ..- attr(*, "label")= chr "Import"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
## ..- attr(*, "label")= chr "Weight_I"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Export : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
## ..- attr(*, "label")= chr "Export"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
## ..- attr(*, "label")= chr "Weight_E"
## ..- attr(*, "format.stata")= chr "%9.0g"
## - attr(*, "label")= chr "Written by R."
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
# Structure of sugar again
str(sugar)
## tibble [10 x 5] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:10], format: "2013-12-31" "2012-12-31" ...
## $ Import : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
## ..- attr(*, "label")= chr "Import"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
## ..- attr(*, "label")= chr "Weight_I"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Export : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
## ..- attr(*, "label")= chr "Export"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
## ..- attr(*, "label")= chr "Weight_E"
## ..- attr(*, "format.stata")= chr "%9.0g"
## - attr(*, "label")= chr "Written by R."
Import SPSS data with haven
The haven
package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav
files - or read_por() - for .por
files.
In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.
# haven is already loaded
# Import person.sav: traits
traits <- read_sav("_data/person.sav")
# Summarize traits
summary(traits)
## Neurotic Extroversion Agreeableness Conscientiousness
## Min. : 0.00 Min. : 5.00 Min. :15.00 Min. : 7.00
## 1st Qu.:18.00 1st Qu.:26.00 1st Qu.:39.00 1st Qu.:25.00
## Median :24.00 Median :31.00 Median :45.00 Median :30.00
## Mean :23.63 Mean :30.23 Mean :44.55 Mean :30.85
## 3rd Qu.:29.00 3rd Qu.:34.00 3rd Qu.:50.00 3rd Qu.:36.00
## Max. :44.00 Max. :65.00 Max. :73.00 Max. :58.00
## NA's :14 NA's :16 NA's :19 NA's :14
# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40)
## # A tibble: 8 x 4
## Neurotic Extroversion Agreeableness Conscientiousness
## <dbl> <dbl> <dbl> <dbl>
## 1 38 43 49 29
## 2 20 42 46 31
## 3 18 42 49 31
## 4 42 43 44 29
## 5 30 42 51 24
## 6 18 42 50 25
## 7 27 45 55 23
## 8 18 43 57 34
Great! You imported a data file from SPSS correctly using read_sav()
. Deepen your knowledge in the following exercise!
Factorize, round two
In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled
class. This is done to keep all the labelling information that was originally present in the .sav
and .por
files. It’s advised to coerce (or change) these variables to factors or other standard R classes.
The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found here
# haven is already loaded
# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")
# Display summary of work$GENDER
summary(work$GENDER)
## Length Class Mode
## 474 character character
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)
# Display summary of work$GENDER again
summary(work$GENDER)
## Female Male
## 216 258
Well done! The as_factor() function has more arguments you can specify, have a look at its documentation to discover more! Proceed to the next video.
SAS
.sas7bdat
.xport
sas7bdat
packageSTATA
read.dta()
SPSS
read.spss()
Import STATA data with foreign (1)
The foreign
package offers a simple function to import and read STATA data: read.dta().
In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta
contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in your working directory, you can download it here if you want to experiment some more.
# Load the foreign package
library(foreign)
# Import florida.dta and name the resulting data frame florida
florida <- read.dta("_data/florida.dta")
# Check tail() of florida
tail(florida)
## gore bush buchanan nader total
## 62 2647 4051 27 59 6784
## 63 1399 2326 26 29 3780
## 64 97063 82214 396 2436 182109
## 65 3835 4511 46 149 8541
## 66 5637 12176 120 265 18198
## 67 2796 4983 88 93 7960
Congratulations! Your first exercise on using foreign
to import STATA data is complete. Proceed to the next exercise to learn the proper handling of the arguments that read.dta() includes!
Import STATA data with foreign (2)
Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.
The arguments you will use most often are convert.dates
, convert.factors
, missing.type
and convert.underscore
. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta
to find out about about the default values.
The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the _data
folder in your working directory.
# foreign is already loaded
# Specify the file path using file.path(): path
path <- file.path("_data", "edequality.dta")
# Create and print structure of edu_equal_1
edu_equal_1 <- read.dta(path)
str(edu_equal_1)
## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
## $ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity_head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr_ind_annual : num 347 347 347 1635 1635 ...
## $ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours_mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr [1:27] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int [1:27] 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr [1:27] "" "" "location" "region" ...
## - attr(*, "var.labels")= chr [1:27] "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr [1:3] "_dta" "_svy_su1" "cluster"
## ..$ : chr [1:3] "_dta" "_svy_strata1" "strata"
## ..$ : chr [1:3] "_dta" "_svy_stages" "1"
## ..$ : chr [1:3] "_dta" "_svy_version" "2"
## ..$ : chr [1:3] "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr [1:3] "_dta" "ReS_Xij" "cons"
## ..$ : chr [1:3] "_dta" "ReS_str" "0"
## ..$ : chr [1:3] "_dta" "ReS_j" "group"
## ..$ : chr [1:3] "_dta" "ReS_ver" "v.2"
## ..$ : chr [1:3] "_dta" "ReS_i" "hhid dur"
## ..$ : chr [1:3] "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr [1:3] "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "urban location" "rural location"
## ..$ region : Named int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr [1:9] "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "male" "female"
## ..$ s2_q3 : Named int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr [1:9] "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
## ..$ : Named int [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "Not unemployed" "Unemployed"
## ..$ : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int [1:5] 1 2 3 4 5
## .. ..- attr(*, "names")= chr [1:5] "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
# Create and print structure of edu_equal_2
edu_equal_2 <- read.dta(path, convert.factors = FALSE)
str(edu_equal_2)
## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : int 1 1 1 1 1 2 2 2 1 1 ...
## $ region : int 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity_head : int 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : int 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : int 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : int 1 2 2 2 2 2 2 2 2 2 ...
## $ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr_ind_annual : num 347 347 347 1635 1635 ...
## $ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours_mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr [1:27] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int [1:27] 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr [1:27] "" "" "location" "region" ...
## - attr(*, "var.labels")= chr [1:27] "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr [1:3] "_dta" "_svy_su1" "cluster"
## ..$ : chr [1:3] "_dta" "_svy_strata1" "strata"
## ..$ : chr [1:3] "_dta" "_svy_stages" "1"
## ..$ : chr [1:3] "_dta" "_svy_version" "2"
## ..$ : chr [1:3] "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr [1:3] "_dta" "ReS_Xij" "cons"
## ..$ : chr [1:3] "_dta" "ReS_str" "0"
## ..$ : chr [1:3] "_dta" "ReS_j" "group"
## ..$ : chr [1:3] "_dta" "ReS_ver" "v.2"
## ..$ : chr [1:3] "_dta" "ReS_i" "hhid dur"
## ..$ : chr [1:3] "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr [1:3] "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "urban location" "rural location"
## ..$ region : Named int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr [1:9] "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "male" "female"
## ..$ s2_q3 : Named int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr [1:9] "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
## ..$ : Named int [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "Not unemployed" "Unemployed"
## ..$ : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int [1:5] 1 2 3 4 5
## .. ..- attr(*, "names")= chr [1:5] "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
# Create and print structure of edu_equal_3
edu_equal_3 <- read.dta(path, convert.underscore = TRUE)
str(edu_equal_3)
## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
## $ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity.head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ income.mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr.ind.annual : num 347 347 347 1635 1635 ...
## $ educ.completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade.complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade.all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason.OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn.mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn.ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours.week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours.mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy.pension.amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr [1:27] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int [1:27] 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr [1:27] "" "" "location" "region" ...
## - attr(*, "var.labels")= chr [1:27] "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr [1:3] "_dta" "_svy_su1" "cluster"
## ..$ : chr [1:3] "_dta" "_svy_strata1" "strata"
## ..$ : chr [1:3] "_dta" "_svy_stages" "1"
## ..$ : chr [1:3] "_dta" "_svy_version" "2"
## ..$ : chr [1:3] "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr [1:3] "_dta" "ReS_Xij" "cons"
## ..$ : chr [1:3] "_dta" "ReS_str" "0"
## ..$ : chr [1:3] "_dta" "ReS_j" "group"
## ..$ : chr [1:3] "_dta" "ReS_ver" "v.2"
## ..$ : chr [1:3] "_dta" "ReS_i" "hhid dur"
## ..$ : chr [1:3] "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr [1:3] "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "urban location" "rural location"
## ..$ region : Named int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr [1:9] "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "male" "female"
## ..$ s2_q3 : Named int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr [1:9] "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
## ..$ : Named int [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "Not unemployed" "Unemployed"
## ..$ : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int [1:5] 1 2 3 4 5
## .. ..- attr(*, "names")= chr [1:5] "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
Great! Can you tell the difference between the different versions of read.dta()? For this data, the first version (where you simply specified the file path) will be most useful to work with. Head over to the next exercise to see if you actually understand your data!
Import SPSS data with foreign (1)
All great things come in pairs. Where foreign
provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE
inside read.spss().
In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav
, which is in your working directory. You can also download it here if you want to play around with it some more.
# foreign is already loaded
# Import international.sav as a data frame: demo
demo <- read.spss("_data/international.sav", to.data.frame = TRUE)
## re-encoding from CP1252
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)
Import SPSS data with foreign (2)
In the previous exercise, you used the to.data.frame
argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.
In this exercise you will experiment with another argument, use.value.labels
. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE
by default which means that so called labelled variables inside SPSS are converted to factors inside R.
You’ll again be working with the international.sav data, which is available in your current working directory.
# foreign is already loaded
# Import international.sav as demo_1
demo_1 <- read.spss("_data/international.sav", to.data.frame = TRUE)
## re-encoding from CP1252
# Print out the head of demo_1
head(demo_1)
## id country contint m_illit f_illit lifeexpt gdp
## 1 1 Argentina Americas 3.0 3.0 16 3375
## 2 2 Benin Africa 45.2 74.5 7 521
## 3 3 Burundi Africa 33.2 48.1 5 86
## 4 4 Chile Americas 4.2 4.4 14 4523
## 5 5 Dominican Republic Americas 12.0 12.7 12 2408
## 6 6 El Salvador Americas 17.6 22.9 11 2302
# Import international.sav as demo_2
demo_2 <- read.spss("_data/international.sav", to.data.frame = TRUE, use.value.labels = FALSE)
## re-encoding from CP1252
# Print out the head of demo_2
head(demo_2)
## id country contint m_illit f_illit lifeexpt gdp
## 1 1 Argentina 2 3.0 3.0 16 3375
## 2 2 Benin 1 45.2 74.5 7 521
## 3 3 Burundi 1 33.2 48.1 5 86
## 4 4 Chile 2 4.2 4.4 14 4523
## 5 5 Dominican Republic 2 12.0 12.7 12 2408
## 6 6 El Salvador 2 17.6 22.9 11 2302
Congratulations on finishing part 2 of Importing Data in R! See you next time!